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ProDis-ContSHC: learning protein dissimilarity 
measures and hierarchical context coherently for 
protein-protein comparison in protein database 
retrieval 



Background: The need to retrieve or classify protein molecules using structure or sequence-based similarity 
measures underlies a wide range of biomedical applications. Traditional protein search methods rely on a pairwise 
dissimilarity/similarity measure for comparing a pair of proteins. This kind of pairwise measures suffer from the 
limitation of neglecting the distribution of other proteins and thus cannot satisfy the need for high accuracy of the 
retrieval systems. Recent work in the machine learning community has shown that exploiting the global structure 
of the database and learning the contextual dissimilarity/similarity measures can improve the retrieval performance 
significantly. However, most existing contextual dissimilarity/similarity learning algorithms work in an unsupervised 
manner, which does not utilize the information of the known class labels of proteins in the database. 

Results: In this paper, we propose a novel protein-protein dissimilarity learning algorithm, ProDis-ContSHC. ProDis- 
ContSHC regularizes an existing dissimilarity measure dy by considering the contextual information of the proteins. 
The context of a protein is defined by its neighboring proteins. The basic idea is, for a pair of proteins if their 
context Af(i) and Af(j) is similar to each other, the two proteins should also have a high similarity. We 
implement this idea by regularizing dy by a factor learned from the context N{i) and N{j). 
Moreover, we divide the context to hierarchial sub-context and get the contextual dissimilarity vector for each 
protein pair. Using the class label information of the proteins, we select the relevant (a pair of proteins that has the 
same class labels) and irrelevant (with different labels) protein pairs, and train an SVM model to distinguish 
between their contextual dissimilarity vectors. The SVM model is further used to learn a supervised regularizing 
factor. Finally, with the new Supervised learned Dissimilarity measure, we update the Protein Hierarchial Context 
Coherently in an iterative algorithm-ProDis-ContSHC. 

We test the performance of ProDis-ContSHC on two benchmark sets, i.e., the ASTRAL 1.73 database and the FSSP/ 
DALI database. Experimental results demonstrate that plugging our supervised contextual dissimilarity measures 
into the retrieval systems significantly outperforms the context-free dissimilarity/similarity measures and other 
unsupervised contextual dissimilarity measures that do not use the class label information. 

Conclusions: Using the contextual proteins with their class labels in the database, we can improve the accuracy of 
the pairwise dissimilarity/similarity measures dramatically for the protein retrieval tasks. In this work, for the first 
time, we propose the idea of supervised contextual dissimilarity learning, resulting in the ProDis-ContSHC 
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algorithm. Among different contextual dissimilarity learning approaches that can be used to compare a pair of 
proteins, ProDis-ContSHC provides the highest accuracy. Finally, ProDis-ContSHC compares favorably with other 
methods reported in the recent literature. 



Background 

Proteins are linear chains of amino acids. The polypeptide 
chains are folded into complicated three-dimensional (3D) 
structures. With different structures, proteins are able to 
perform specific functions in biological processes [1-14]. 
To study the structure-function relationship, biologists 
have a high demand on protein structure retrieval systems 
for searching similar sequences or 3D structures [15]. Pro- 
tein pairwise comparison is one of the main functions of 
such retrieval systems [16]. The need to retrieve or classify 
proteins using 3D structure or sequence-based similarity 
underlies many biomedical applications. In drug discovery, 
researchers search for proteins that share specific chemical 
properties as sources for new treatment. In folding simula- 
tions, similar intermediate structures might be indicative 
of a common folding pathway [17]. 

Related work 

The structural comparison problem in a protein struc- 
ture retrieval system has been extensively studied. In 
[18], a rapid protein structure retrieval system named 
ProtDex2 was proposed by Aung and Tan [18] , in 
which they adopted the information retrieval techniques 
to perform rapid database search without accessing to 
each 3D structure in the database. The retrieval process 
was based on the inverted-file index constructed on the 
feature vectors of the relationship between the second- 
ary structure elements (SSEs) of all the protein struc- 
tures in the database. In order to evaluate the similarity 
score between a query protein structure and a protein 
structure in the database, they adopted and modified the 
well-known YXtf x idj) scoring scheme commonly used in 
document retrieval systems [19]. In [20,21], a 3D shape- 
based approach was presented by Daras et al. The 
method relied primarily on the geometric 3D structure 
of the proteins, which was produced from the corre- 
sponding PDB files, and secondarily on their primary 
and secondary structures. Additionally, characteristic 
attributes of the primary and secondary structures of 
the protein molecules were extracted, forming attribute- 
based descriptor vectors. The descriptor vectors were 
then weighted and an integrated descriptor vector was 
produced. To compare a pair of protein descriptor vec- 
tors, Daras et al. [20,21] used two metrics of similarity. 
The first one was based on the Euclidean distance [22] 
between the descriptor vectors, and the second one was 
based on Mean Euclidean Distance Measure [20,21]. 



Later, Marsolo and Parthasarathy presented two nor- 
malized, stand-alone representations of proteins that 
enabled fast and efficient object retrieval based on 
sequence or structure information [17,23]. For the range 
queries, they specified a range value r and retrieved all 
the proteins from the database which lied within a dis- 
tance r to the query. In their work, distance referred to 
the standard Euclidean distance [22]. In [24], Sael et al. 
introduced a global surface shape representation by 3D 
Zernike descriptors for protein structure similarity 
search. In their study, three distance measures were used 
for comparing 3D Zernike descriptors of protein surface 
shapes, i.e., Euclidean distance, Manhattan distance [25], 
and correlation coefficient-based distance, A fast protein 
comparison algorithm IR Tableau was developed by 
Zhang et al. for protein retrieval purposes in [26], which 
leveraged the tableau representation to compare protein 
tertiary structures. IR tableau compared tableaux using 
feature indexing techniques. In IR Tableau [26], a num- 
ber of similarity functions were applied for comparing a 
pair of protein vectors, i.e., cosine similarity [27], Jaccard 
index [28], Tanimoto coefficient [29], and Euclidean 
distance. 

The basic components of a protein retrieval system 
includes a way to represent proteins and a dissimilarity 
measure that compares a pair of proteins. Most of the 
aforementioned studies focus on the feature representa- 
tion of the proteins, while neglecting the comparison of 
the feature vectors. Such studies usually apply a simple 
similarity or dissimilarity measure for the comparison of 
the feature vectors, such as Euclidean Distance Measure 
used in [17,20,21,23,24,26]. Most of the existing protein 
comparison techniques suffer from the following two 
bottlenecks: 

♦ The dissimilarity measure is a pairwise distance 
measure, which is computed only considering the 
query protein x 0 and a database protein x t as d(x 0 , 
Xf). It does not consider other proteins in the data- 
base, neglecting the effects of the contextual pro- 
teins. If we consider the distribution of the entire 
protein database X = {xj} } j = 1 ... N when computing 
the dissimilarity as d(x 0 , Xi\X), the retrieval perfor- 
mance may benefit from the contextual proteins {Xj}, 
j * i, 

♦ The dissimilarity measure is computed in an unsuper- 
vised way, which does not use the known information 
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of the class labels L = / = 1 ... , N in the database. 
Although we may have no idea about whether x 0 and 
%i belong to the same class (having the same folding 
type etc., / 0 = /;) or not (/ 0 * /;), we do know some prior 
information about other proteins L. In all of the pre- 
vious studies, prior class labels L were not adopted to 
calculate the dissimilarity d{x 0 , x t ). 

Due to these two bottlenecks, traditional protein retrie- 
val systems using pairwise and unsupervised dissimilarity 
measure usually do not achieve satisfactory performance, 
even though many effective protein feature descriptors 
are developed and used. In this paper, we investigate the 
dissimilarity measure and propose a novel learning algo- 
rithm to improve the performance of a given dissimilarity 
measure. 

Recent research in machine learning points out that 
contextual information can be used to improve the dis- 
similarity or similarity measures. This kind of algorithms 
are called contextual or context-sensitive dissimilarity 
learning [30-34]. Unlike the traditional pairwise distance 
d(x 0 , Xj) which only considers the two refereed proteins 
# 0 and x b contextual dissimilarity also considers the con- 
textual proteins X when computing the dissimilarity d(x 0 , 
Xi\X). The existing contextual similarity learning algo- 
rithms can mainly be classified into the following two 
categories: 

Dissimilarity regulation 

The first contextual dissimilarity measure (CDM) was 
proposed by Jegou et al. in [30,31]. They introduced the 
CDM, which significantly improved the accuracy of the 
image search problem. CDM measure took the local dis- 
tribution of the vectors into account and iteratively esti- 
mated the distance update terms in the spirit of 
Sinkhorns scaling algorithm [35], thereby modified the 
neighborhood structure. This regularization was moti- 
vated by the observation that a good ranking was usually 
not symmetric in an image search system. In this paper, 
we will focus on this type of contextual dissimilarity 
learning. 

Similarity transduction on graph 

In [32,33], Bai et al. provided a novel perspective to the 
shape retrieval tasks by considering the existing shapes as 
a group and studying their similarity measures to the 
query shape in a graph structure. For a given similarity 
measure, a new similarity was learned through graph 
transduction. The learning was done in an iterative man- 
ner so that the neighbors of a given shape influenced the 
final similarity to the query. The basic idea is actually 
related to the PageRank algorithm, which forms a founda- 
tion of Google Web search. This method is further 



improved by Wang et al. in [36]. Similar learning algo- 
rithms were also used to rank proteins in a protein data- 
base as in [37,38]. Kuang et al. proposed a general graph- 
based propagation algorithm called MotifProp to detect 
more subtle similarity relationship than the pairwise com- 
parison methods. In [38], Weston et al. reviewed Rank- 
Prop, a ranking algorithm that exploited the global 
network structure of similarity relationship among pro- 
teins in a database by performing a diffusion operation on 
a protein similarity network with weighted edges. 

The drawbacks of the above algorithms lay on two folds. 
On the one hand, such algorithms do not utilize the class 
label information of the database images L, and thus work 
in an unsupervised way. The only one used L is [38]. How- 
ever, the algorithm proposed in [38] had basically the 
same framework as [32,33,37], i.e., protein label informa- 
tion L was only used to estimate the parameters. On the 
other hand, the "context" is fixed in the iterative algo- 
rithms of most of the transduction methods [32,33,37,38]. 
A better way is to update the context using the learned 
similarity measures as in [30,31]. 

To overcome these drawbacks, we develop a novel con- 
textual dissimilarity learning algorithm to improve the per- 
formance of a protein retrieval system. The novel 
dissimilarity measure is regularized by the dissimilarity of 
the contextual proteins (neighboring proteins), while the 
contextual proteins are updated using the learned dissimi- 
larities coherently. The basic idea comes from [39,40], 
which assume that if two local features in two images are 
similar, their context is likely to be similar. In comparison 
to [30,31], which use neighborhood as a single context, we 
partition the neighborhood into several hierarchical sub- 
context corresponding to the learned dissimilarities. With 
the sub-context, we compute the dissimilarity of sub-con- 
text of a pair of proteins and construct the hierarchial sub- 
contextual dissimilarity vector. Moreover, using the label 
information L, we select pairs of proteins belonging to the 
same classes {(x if Xj)\k = lj} as the relevant protein pairs. 
We also select the irrelevant protein pairs {(x k , xj)\l k * //}. 

Finally, we train a support vector machine (SVM) [41] 
to distinguish between the relevant and the irrelevant 
protein pairs. The output of the SVM will further be 
used to regularize the dissimilarity in an iterative 
manner. 

Methods 

This section describes our contextual protein-protein 
dissimilarity learning algorithm, which utilizes the con- 
textual proteins and class label information of the data- 
base proteins to index and search protein structures 
efficiently. We will demonstrate that our idea is general 
in the sense that it can be used to improve the existing 
similarity/dissimilarity measures. 



Wang et al. BMC Bioinformatics 2012, 13(Suppl 7):S2 
http://www.biomedcentral.eom/1 471 -21 05/1 3/S7/S2 



Page 4 of 14 



Protein structure retrieval framework 

In a protein retrieval system, the query and the database 
proteins are firstly represented as feature vectors. Here, we 
denote the query protein feature vector as x§ and database 
protein feature vectors as X = {xi, x 2 , ... > x N }, where N is 
the number of proteins in the database. Then, based on a 
distance measure d oi = d(x 0 , Xi), we compute the distance 
of x 0 and all the proteins in the database, i.e., {d 01) d 02 , ... , 
d 0N }. The database proteins are then ranked according to 
the distances. The k most similar ones are returned as the 
retrieval results. We illustrate the outline of the protein 
retrieval system in Figure 1. 

ProDis-ContSHC: the contextual dissimilarity learning 
algorithm 

In this section, we will introduce the novel contextual 
protein-protein dissimilarity learning algorithm. We first 
give the definition of the hierarchical context of a pro- 
tein, which will be used to compute the contextual dis- 
similarity and regularize the dissimilarity measure. Then 
a more discriminative regularization factor is learned 
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Figure 1 Flowchart of protein retrieval systems. 



using the class labels of the database proteins. Finally, 
we propose the Supervised regulating of Protein-protein 
Dissimilarity and updating of the Hierarchical Context 
Coherently in an iterative manner, resulting in the Pro- 
Dis-ContSHC algorithm. 

Using hierarchical context to regularize the dissimilarity 
measure 

Here, we define a protein x/s context as its K nearest 
neighbors N(i). The dissimilarity between two sets of 
context is measured by the contextual dissimilarity as 



1 

IP 



meJ\f{i),neJ\f{j) 



(i) 



The contextual dissimilarity is illustrated in Figure 2(a). 

Furthermore, instead of averaging all the pairwise dis- 
similarities between the two context N{i) and N(j), 
we propose the hierarchical context by splitting the con- 
text M[i) to P "sub-context" M p (i),p = {1, • • • ,P] 
according to their distances to x t . To be more specific, 
sub-context Np{i) is defined as 



Np(i) = {Xj\Xj is among the h! — th to k" — th 

nearest neighbors of x\, according to [dij] 
] e {1, , i- 1, i+ 1, ••• , N}} 



(2) 



where K = (p - 1) x k, k" = (p - 1) x k + k, k is the 
size of a sub-context, and P is the number of sub- 
context. In this way, we can compute the contextual 
dissimilarity by averaging the dissimilarity of the sub- 
context as 



P |_ meAf p {i),neJ\fp{j) 

4ew 



(3) 



d mni p = !,■■■ ,P } is the 



where <%(p) = ^ E : 
hierarchical sub-contextual dissimilarity. Figure 2(b) 
illustrates the idea of sub-contextual dissimilarity. 

Intuitively, if the context of two proteins is dissimilar 
to each other (r^ is higher than the average), they should 
have a higher dissimilarity value, and vice versa. We 
implement this by multiplying a coefficient, which is the 
ratio of r t j to the average of all the contextual dissimilar- 
ity r= -^Eijnj, 



■ da x -f- 
r 

: dij X 8ij 



(4) 



Here, 8y = | is a regularization factor for dip with 
which we can improve dy by its contextual information. 
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(a) (b) 

Figure 2 Illustration of context-based dissimilarity and hierarchical context-based dissimilarity. The two proteins x, and xj, on which the 
dissimilarity is to be measured, are in the first row. The nearest neighbors of these two proteins are listed below them as the context, 
respectively, (a) The traditional context A/"(z); (b) The proposed hierarchical context Ap(z'), p = {1, 2, 3}. 



Moreover, this procedure can be done in an iterative 
manner. We can use the regularized dissimilarity mea- 
sure d*j to re-define the new hierarchical context Mp{i). 
In this way, we can learn the protein-protein dissimilar- 
ity d*j and hierarchical context A/p(i) coherently. 
Supervised regularization factor learning 
We try to utilize the label information L = {l lt ... , l N } of 
the database proteins to learn a better regularization fac- 
tor S i; % The class information is adopted both in the intra- 
class and interclass dissimilarity computation to 
maximize the Fisher criterion [42] for protein class separ- 
ability. Firstly, we can select a number of protein pairs 
{7= [U j)\hj = 1> ... > N}. For each pair, we compute the 
hierarchical contextual dissimilarities and organize them 
as a P-dimensional dissimilarity vector d 7 = [dy (1) dy (2) 
... dy (P)] T , as shown in Figure 3. Then, inspired by the 
score fusion rule [43,44], using L, we further label each 
pair 7 = (z, /) as a relevant pair y 7 = +1 if l t = lp or an irre- 
levant pair y r = -1 otherwise. 

Now with the training samples as T = {(d r j r )}, 7=1, ... 
,tv C 2 , we train a binary SVM [41] classifier to distinguish 
between the relevant pairs and the irrelevant pairs. The 
publicly available package SVMlight [45] is applied to 
implement the SVM on our training set T. This package 



allows us to optimize a number of parameters and offers 
the options to use different kernel functions to obtain the 
best classification performance [46]. The separating 
hyperplane generated by SVM model is given by 

/(d) = d • w + b (5) 

where w is a vector orthogonal to the hyperplane, and 
b is a parameter that minimizes | \w\ | 2 and satisfies the 
following conditions: 

y y (dyw + b)>l (6) 

for all 1 < 7 < N C 2 , where N C 2 is the total number of 
examples (protein pairs). An SVM model with a linear 
decision boundary is shown in Figure 3 to distinguish 
the relevant protein pairs from the irrelevant ones. Note 
that not all the N C 2 possible protein pairs are necessary 
to be included to train the SVM model (5). For any pair 
of proteins (x if xj), after we compute its contextual dis- 
similarity vector dip the trained SVM classifier is applied 
to get the distance of this point to the margin boundary 
of SVM as fij = fiAij) . Apparently, fij is a measure of 
dissimilarity of the context of this pair of proteins. 
Thus, it can be used to form a regularization factor as 
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r (d ir w + b)l (7) 
= £xp 

L ° 

where a is a preemptor of the factor. With this regu- 
larization factor learned from the contextual proteins, 
we regularize the dissimilarity of protein pair (x if xj) 
as 

d*j = dij x S' tj (8) 

Updating the context and dissimilarity coherently 

With the learned dissimilarity measure d* , we can re- 
define the "context" of a protein x t according to 
its dissimilarity to all the other proteins 
df jf j e {0, • • • , x - 1, i + 1, • • • , N] . The new "hier- 
archical-context" relying on d| is donated as 
Np(i),p = {1, • • • , P}, In this way, we can develop an 
iterative algorithm that learns d| and 
N*{i),p = {I, ••• , P} coherently. Since Af p *(i) impli- 
citly depends on d| through the nearest neighbors of 
we use a fixed-point recursion method [47] to 
solve dj* . In each iteration, Ap*(i) is first computed 



by using the previous estimation of d* , which is then 
updated by multiplying the regularization factor as 
in (8). The iterations are carried out for T times, as 
given in Algorithm 1. 

With the learned dissimilarity matrix D^ +1) , we use D 
(t+1) [0; 1, ... , N] as the dissimilarity between the query 
protein x 0 and the database proteins {xi, ... , x N }. Thus 
we can rank the database proteins in an ascending 
order. 

Efficient implementation of ProDis-ContSHC 

The proposed learning algorithm is time-consuming. 
Therefore, it is not suitable for realtime protein retrieval 
systems. Here we propose several techniques to signifi- 
cantly improve the efficiency of the algorithm. 

♦ Similar to [33], in order to increase the computa- 
tional efficiency, it is possible to run ProDis- 
ContSHC for only part of the database of the known 
proteins. Hence, for each query protein x 0 , we first 
retrieve AT « N of the most similar proteins, and per- 
form ProDis-ContSHC to learn the dissimilarity 
matrix of size (AT + 1) x (AT + 1) for only those pro- 
teins. Then we calculate the new dissimilarity 
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Figure 4 Efficient implementation of ProDis-ContSHC (a) Performing ProDis-ContSHC on the original matrix of size {N + 1) x {N + 1) from 
the entire dataset; (b) Performing ProDis-ContSHC on a subset of the database proteins, i.e., a dissimilarity matrix of size (A/' + 1) x (A/' + 1); (c) 
Using the symmetry property of the dissimilarity matrix to reduce the training time. 



measure D J (at + i) x (at + i) f° r only those (AT + 1) 
proteins. Here, we assume that all the relevant pro- 
teins will be among the top AT most similar proteins. 
This strategy is illustrated in Figure 4(a) and 4(b). 

♦ Most of the dissimilarity and similarity measures 
are symmetric ones, i.e., d t j = dj t . As can be observed 
in (13), the regularization of dy is also symmetric. 
Therefore, it is possible to develop an efficient learn- 
ing algorithm by using this property. In the algo- 
rithm, all the computation results of (z, ;) (such as 
dij and 30 can be used directly by (/, i). In this way, 
we can save almost half of the computational time, 
as shown in Figure 4(c). 

♦ A bottleneck of ProDis-ContSHC may be the train- 
ing procedure for the SVM model in each iteration. 
For a database of N proteins belonging to C classes, 
there are N C 2 protein pairs, in which J2^=i n c Ci are 
relevant pairs, while J2^=i X^y c N c x N c > are irrele- 
vant pairs, where C is the number C of the protein 
classes and N c is the number of proteins in the c-th 

class (Yl^i N c = N) • There might be a huge number 
of protein pairs available for the SVM training. How- 
ever, it is not necessary to include all of them in the 
training process. One can select a small but equal 
number of the relevant and the irrelevant pairs to 
train the SVM classifier. This is an effective way to 
reduce the training time of SVM. 

Algorithm 1 ProDis-ContSHC: Supervised Learning 
of Protein Dissimilarity and Updating Hierarchical Con- 
text Coherently. 

Require: Input D = [<i; ; ](jv + i)x(7v+i) : matrix of size 
(A/+l)x(A/+l) of pairwise protein feature distances, where x 0 



is the query protein and {xi, ... , x N } are the database 
proteins; 

Require: Input k: size of the hierarchical sub-context; 
Require: Input P: number of the hierarchical context; 

Initialize dissimilarity matrix: D (1) = D; 
for t=l,...,T do 

Update the hierarchical context for each protein 
x t :tf t) (i),(p=l l P), 

Np\i) = {Xj\Xj is among the k r — th to k" — th 

nearest neighbors of X{, according to (9) 
D w (i;l, ... , N)} 

where I< = (p - 1) x k, k" = (p - 1) x k + k, and 

DW(/;0, N) = [$> •••-<$]• 

Compute the contextual proteins dissimilarity vector 

d!p for each pair of proteins (i, ;'), i, j L {0, ... , N}: 



$(2) •••4 t) (P)] T 



(10) 



where d { j (p) = £ Y, m ^'\i),n^%) 
Select relevant and irrelevant protein pairs and label 
them as y y = +1 and y r = -1 respectively, train an 
SVM model for their contextual dissimilarity vectors 
dW as 



(11) 



Compute the distance to the SVM margin boundary 
for the contextual dissimilarity vector dj^ of each 
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pair of proteins as y^) = /^(dj^)> and set a regulari- 
zation factor for this pair of proteins: 



Si 



exp{—^-) 
o 



(12) 



Update the pairwise protein dissimilarity measures: 
for i = 0, 1, ... , N do 
for j = 0, 1, ... , N do 



(13) 



end for 
end for 

D ( t+i ) = [4 t+i) ] (N+1)x(N+1) . 

end for 

Output the dissimilarity matrix: D (t+1) . 
Benchmark sets 

To evaluate the proposed ProDis-ContSHC algorithm, 
we conduct experiments on two different benchmark 
sets, i.e., the ones used in [21] and [26] respectively. 
ASTRAL 1.73 protein domain dataset 
Following [26], we use the following database and 
queries as our first benchmark set: 

Database The ASTRAL 1.73 [48] 95% sequence-identity 
non-redundant data set is used as the protein database. 
We generate our index database from the tableau data 
set published by Stivala et al. [49], which contains 
15,169 entries. 

Queries A query data set containing 200 randomly 
selected protein domains is used in our experiment. For 
each query, a list that contains all the proteins in the 
respective index database is returned with the ranking 
scores. 

We generate a vector of features x for a given protein 
based on its tableau representation [49]. 
FSSP/DALI protein dataset 

To evaluate the performance of the proposed methods, a 
portion of the FSSP database [50] is selected as in [21]. 
This dataset has 3,736 proteins classified into 30 classes. 
It's constructed according to the DALI algorithm 
[51,52]. The protein numbers in different classes varies 
2 to 561. For protein feature representation, the follow- 
ing two features are extracted from the 3D structure 
and the sequence of a protein as in [20,21]: 

♦ The Polar-Fourier transform, resulting in the FT 02 
features; 

♦ Krawtchouk moments, resulting in the Kraw 00 
features. 



The descriptor vectors are weighted and an integrated 
descriptor vector is produced as x, which will be used 
for the protein retrieval tasks. 

Results and discussion 

Results on ASTRAL 1.73 dataset 

To compare a query protein x 0 to a protein x t in the 
ASTRAL 1.73 dataset, we compute the cosine similarity 
[27] as the baseline similarity measure as in [26]. Cosine 
similarity [27] simply calculates the cosine of the angle 
between the two vectors Xi and Xu 



Xi ' Xj 



INI \\xj\\ 



(14) 



A higher cosine similarity score implies a smaller 
angle between the two vectors. Although ProDis- 
ContSHC is proposed to learn protein-protein dissimi- 
larity dip it can be extended easily to learn similarity s^ 
as well. The only difference is to set the regularization 

factor as ^ = exp(%) instead of ^ = exp(-f) in (7). 

ROC curve and precision-recall curve performance 

SCOP [53] fold classification is used as the ground truth to 
evaluate the performance of the different methods. To 
fairly compare the accuracy, we use the receiver operating 
characteristic (ROC) curve [54], the area under this ROC 
curve (AUC) [54], and the precision-recall curve [55]. 
Given a query protein x 0 which belongs to the SCOP fold 
/ 0 , the top k proteins returned by the search algorithms are 
considered as the hits. The remaining proteins are consid- 
ered as the misses. For the z'-th protein x t belonging to the 
SCOP fold l b if // = l 0 and i < k, the protein x t is defined as 
a true positive (TP). On the other hand, if l t * l 0 and i < k, 
Xi is defined as a false positive (FP). If l t * l 0 and i > k, x t is 
defined as a true negative (TN). Otherwise, x t is a false 
negative (FN). Using these definitions, we can then com- 
pute the true positive rate (TPR or recall), the false posi- 
tive rate (FPR), recall and precision as follows: 



TP 

TPR = — 
P 



TP 



FPR 



TP + FN 
FP 



FP _ 
~N ~ FP + TN 



(15) 



Recall = 
Precision 



TP 



TP + FN 
TP 

~ TP + FP 



(16) 



TPR k , FPR k , Recall k , and Precision k are calculated for 
all 1 < k < N , where N is the size of the database. The 
ROC defines a curve of points with FPR k as the abscissa 
and TPR k as the ordinate. Precision-recall defines a 
curve with recall k and precision k as abscissa and ordinate 
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respectively. We use the area under the ROC curve 
(AUC) as a single-figure measurement for the quality of 
a ROC curve [54], and use the averaged AUC over all 
the queries to evaluate the performance of the method. 

To demonstrate the contribution of the supervised 
learning idea, we also compare ProDis-ContSHC with 
its unsupervised counterpart, i.e., contextual dissimilarity 
algorithm based on the unsupervised learning, i.e., Pro- 
Dis-ContHC. ProDis-ContHC is also applied to improve 
the cosine similarity. We also compare with the widely- 
used contextual dissimilarity measure [30,31] (CDM), 
which tries to take into account the local distribution of 
the vectors and iteratively estimates distance update 
terms in the spirit of Sinkhorns scaling algorithm, 
thereby modifying the neighborhood structures. 

The performance of different methods are compared, 
as shown in Figure 5. Figure 5(a) shows the ROC curves 
of the original cosine similarity and its improved versions 
by three contextual similarity learning algorithms on the 
ASTRAL 1.73 [48] 95% dataset, with different numbers 
of proteins returned to each query. It can be seen from 
Figure 5(a) that the TPR of all the methods increases as 
the FPR grows. The reason is due to the fact that, pro- 
vided the number of queries is fixed, when the number k 
of returned proteins to each query is very small, the 
returned proteins are not enough to "represent" the class 
features of the query, which then causes the low TPR. 
Meanwhile, in this situation, most of the returned pro- 
teins are highly confident of belonging to the same class 
as the query, resulting in a low FPR. Moreover, the TPR 
is almost 100% when the FPR>50%. It is clear that the 
ROC curve of ProDis-ContSHC completely embodies the 
ROC curves of the other three methods, which implies 
ProDis-ContSHC is the best method among the four. 
That also means that supervised learning is better than 
unsupervised learning for this purpose. ProDis-ContHC, 
on the other hand, is the second best method among 
these four, which demonstrates the contribution of the 
hierarchical sub-context idea to the traditional contextual 
dissimilarity measures. The overall AUC results are listed 
in Table 1, from which similar conclusions can be drawn. 
It is noticeable that the AUC for ProDis-ContSHC is very 
close to 1, which means ProDis-ContSHC works almost 
perfectly on this dataset. We further compare these four 
methods by the precision-recall curves, which are shown 
in Figure 5(b). It can be seen that the proposed contex- 
tual similarity learning algorithms significantly outper- 
form the traditional methods. ProDis-ContSHC, again, is 
consistently the best method among the four. 

Regarding the efficiency of the method, in this experi- 
ment, the learning time of the ProDis-ContSHC is 
longer than that of the ProDis-ContHC and CDM. This 
is because in each iteration of the learning algorithm, a 
quadratic programming problem with many training 



ROC curves for 200 query set in ASTRAL 1 .73 95% data set. 
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Figure 5 Performance of similarity measures on the ASTRAL 
1.73 90% dataset. (a) The ROC curves of the original similarity 
measure, and the improved measures by ProDis-ContSHC, ProDis- 
ContHC, and CDM, respectively, (b) The precision-recall curves of the 
original similarity measure, and the improved measures by ProDis- 
ContSHC, ProDis-ContHC, and CDM, respectively. 



Table 1 Performance of different retrieval methods on 
the ASTRAL 1 



Method 



AUC 



IR Tableau: Cosine Similarity + ProDis-ContSHC 


0.973 


IR Tableau: Cosine Similarity + ProDis-ContHC 


0.961 


IR Tableau: Cosine Similarity + CDM [30,31] 


0.954 


IR Tableau: Cosine Similarity [26] 


0.948 


Tableau Search [56] 


0.871 


QP Tableau [49] 


0.925 


Yakusa [57] 


0.950 


SHEBA [58] 


0.941 


VAST [59,60] 


0.890 


TOPS [61,62] 


0.871 



AUC results for QP Tableau [49], SHEBA [58] and VAST [59,60] are taken from 
[49], which used exactly the same query set and the same dataset as our 
experiments. 
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protein pairs have to be solved to train the SVM. In 
addition, the computation of the regularization factor of 
supervised similarity learning algorithm needs more 
function evaluations. 

We also compare the proposed algorithms with seven 
other protein retrieval methods, i.e., tableau search [56], 
QP tableau [49], Yakusa [57], SHEBA [58], VAST 
[59,60], and TOPS [61,62]. The overall AUC values are 
shown in Table 1. It can be concluded that the tableau 
feature based methods do not always achieve better per- 
formance than other methods, such as tableau search. 
Among the existing tableau feature based methods, IR 
tableau outperforms the others. Yakusa and SHEBA also 
have comparable performance. As seen in Table 1, the 
AUC of the proposed algorithms is clearly better than 
all the other methods. 

Improving different similarity measures via contextual 
dissimilarity learning algorithms 

To further evaluate the robustness of our method, we 
test the behavior of ProDis-ContSHC and other contex- 
tual similarity learning algorithms on different similarity 
measures. A group of experiments are conducted on the 
ASTRAL 1.73 95% dataset with the following similarity 
measures: 

♦ The cosine similarity [27] as introduced in the pre- 
vious section. 

♦ The Jaccard index [28]: it is defined as the size of 
the intersection divided by the size of the union of 
two sets, i.e., 



J(Xj/ Xj^ 



l*if>jl 

\Xi\JXj\ 



(17) 



♦ The Tanimoto coefficient [29]: it is a generalization 
of the Jaccard index, defined as 



X[ ' Xj 



Xi\r + iFjll - Xi-Xi 



(18) 



♦ Squared Euclidean distance [22]: it is another 
means of measuring similarity of proteins. 



where Xi(m) is the m-th element of vector x t , 
ProDis-ContSHC, ProDis-ContHC, and the CDM 
algorithms are applied to improve each of these similar- 
ity measures, respectively. The AUC values of the corre- 
sponding retrieval systems are plotted in Figure 6. In 
general, improving the original similarity measure by 
ProDis-ContSHC leads to the largest improvement. The 
only exception is for Tanimoto coefficient, on which 



ProDis-ContSHC has slightly lower AUC than ProDis- 
ContHC, but comparable AUC to the CDM. One possi- 
ble reason is that the supervised classifier fail to capture 
the real distribution of the contextual similarity. ProDis- 
ContHC, on the other hand, also performs better than 
the CDM algorithm and the original similarity measures. 
This strongly suggests that our previous conclusions are 
valid and consistent. That is, hierarchical sub-contextual 
information can remarkably improve the traditional con- 
text-based similarity measures, whereas supervised 
learning can further improve the accuracy for most of 
the input similarity measures. 

Results on FSSP/DALI dataset 

Unlike the similarity measure used in the last experiment, 
here we use the Euclidean distance [22] to compare a pair 
of proteins as the baseline dissimilarity measure as in 
[20,21]. In this way, we have an idea about how our algo- 
rithms work with both similarity and dissimilarity mea- 
sures. For a query protein x 0 , the pairwise Euclidean 
distances, d oi , i = 1, 2, ... , N , are ranked. The top k pro- 
teins are returned as the retrieval results. To evaluate the 
performance of the proposed algorithms, we test them 
on both the protein retrieval and the protein classifica- 
tion tasks, following [20,21]. 
Performance on protein retrieval 

The efficiency of the proposed dissimilarity learning algo- 
rithm is first evaluated in terms of the performance on 
the protein retrieval task. In this case, each protein x t e 
X of the dataset is used as a query x 0 and the retrieved 
proteins are ranked according to the shape dissimilarity 
d 0 jto the query, where / = 1, 2, ... , i - 1, i + 1, ... , N. We 
also use the precision-recall curve to demonstrate the 
performance of the proposed methods, where precision is 
the proportion of the retrieved proteins that are relevant 
to the query and recall is the proportion of the relevant 
proteins in the entire dataset that are retrieved as the 
results. 

To test the robustness and consistency of our methods, 
we apply our methods to three different protein descrip- 
tor vectors, i.e., Daras et al.'s FT 02 > Kraw 00t and 
FT 0 2&Kraw 00 [20,21] geometric descriptor vectors. We 
also apply the unsupervised version of our algorithm, 
ProDis-ContHC, and the CDM algorithm to the same 
dissimilarity measure and the same descriptor vectors to 
compare with ProDis-ContSHC. Figure 7 shows the pre- 
cision-recall curves for different algorithms on different 
protein descriptor vectors. As mentioned in [20,21], there 
is always a tradeoff between the precision and recall 
values. This is clearly shown in Figure 7(a), (b), and 7(c), 
in which the algorithms reach their peak precision values 
at the smallest recall values. It can be seen that ProDis- 
ContSHC has a clearly better performance than any 
other method, whereas ProDis-ContHC is the second 
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Figure 6 Performance of similarity measures on different base measures on the ASTRAL 1.73 90% dataset. Performance of similarity 
measures on different base measures on the ASTRAL 1.73 90% dataset. The four base measures being tested are cosine similarity [27], the 
Jaccard index [28], the Tanimoto coefficient [29], and the Euclidean distance [22]. 



best one. This is quite consistent with what is observed in 
the last experiment, in which a similarity measure is used. 
Therefore, our algorithms can consistently improve any 
similarity/dissimilarity measure. Among the three protein 
descriptor vectors, ProDis-ContSHC performs the best 
on the combined vector, i.e., Kraw 00 &cFT 02 . This is 
because this vector not only employs the context, but 
also their relevant information to predict the relationship 
between the query and the database proteins. 
Performance on protein classification 

The performance of the method is also evaluated in 
terms of the overall classification accuracy [20,21]. To 
be more specific, for each protein x t in the database, a 
dissimilarity measure is applied after removing that pro- 
tein from the database ("leave-one-out" experiment 
[63]). A class label / 0 is then assigned to the query x 0 
according to the label of the nearest database protein. 
The overall classification accuracy is given by: 

„ , . Number of correctly predicted proteins /orv\ 

Overall Classification Accuracy = ; — : — I z(J ) 

Total number of proteins in the database v ' 

We again conduct this experiment with the three 
descriptor vectors, i.e., FT 02 > Kraw 00 , and FT 02 &Kraw 00 . 
The overall classification accuracy is shown in Table 2. 
It can be seen that ProDis-ContSHC has a consistently 
higher than 99% accuracy on all the three descriptor 



vectors. Each dissimilarity measure achieves its highest 
accuracy on Kraw 00 &i-T 02 . Among the four dissimilar- 
ity measures, ProDis-ContSHC has the highest accuracy, 
whereas ProDis-ContHC is the second best one. There- 
fore, this conclusion has been demonstrated on both 
similarity and dissimilarity measures on different data- 
sets with different descriptor vectors. 

Conclusions 

We have introduced in this paper a novel contextual dis- 
similarity learning algorithm for protein-protein compar- 
ison in protein database retrieval tasks. Its strength 
resides in the use of the hierarchical context between a 
pair of proteins and their class label information. By 
extensive experiments, this novel algorithm has been 
demonstrated to outperform the traditional context- 
based methods and their unsupervised version. 

We formulate the protein dissimilarity learning problem 
as a context-based classification problem. Under such a 
formulation, we try to regularize the protein pairwise dis- 
similarity in a supervised way rather than the traditional 
unsupervised way. To the best of our knowledge, this is 
the first study on supervised contextual dissimilarity learn- 
ing. We propose a novel algorithm, ProDis-ContSHC, 
which updates a proteins hierarchical sub-context and the 
dissimilarity measure coherently. The regularization 
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Figure 7 Performance of dissimilarity measures on the FSSP/ 

DALI dataset. (a) The precision-recall curves of the original 

dissimilarity measure, and the improved measures by ProDis- 

ContSHC, ProDis-ContHC, and CDM, respectively, with the descriptor 

vector FT 02 &Kraw 00 . (b) The precision-recall curves with the 

descriptor vector FT 02 . (c) The precision-recall curves with the 

descriptor vector Kraw 00 . 
v J 



factors are learned based on the classification of the rele- 
vant and the irrelevant protein pairs. The algorithm works 
in an iterative manner. 

Table 2 Overall classification accuracy using different 
protein descriptors and the Euclidean distance measure 

Dissimilarity measure Descriptors 

FT 02 Kraw 00 Kraw 00 &FT 02 

Euclidean Distance + ProDis-ContSHC 0.9925 0.9954 0.9971 
Euclidean Distance + ProDis-ContHC 0.9890 0.9917 0.9928 
Euclidean Distance + CDM [30,31] 0.9869 0.9895 0.9909 
Euclidean Distance [20,21] 0.9850 0.9879 0.9890 



Experimental results demonstrate that supervised 
methods are almost always better than their unsuper- 
vised counterparts on all the databases with all the fea- 
ture vectors. The proposed method, even though mainly 
presented for protein database retrieval tasks, can be 
easily extended to other tasks, such as RNA sequence- 
structure pattern indexing [64], retrieval of high 
throughput phenotype data [65], and retrieval of geno- 
mic annotation from large genomic position datasets 
[66]. The approach may also be extended to the data- 
base retrieval and pattern classification problems in 
other domains, such as medical image retrieval [67-69], 
speech recognition, and texture classification [70]. 
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